最近团队在APM方向发力，需要在产品的深度学习模型的速度和占用空间大小两个维度来进行提升

目前使用的是Tensorflow Lite的格式在进行模型运算，想通过Tensorflow官方在2017年推出的预研项目XLA对模型进行优化，在官方示例过程的结论中模型使用XLA/AOT优化的模型比之前使用.pb格式的模型运行速度会提升10%~200%（有个别情况），占用空间会有4x的缩小

本文将逐一展示完整优化过程及遇到的坑（解决方案）。

Step -1: 使用XLA将模型编译为AOT（ahead-of-time）代码的步骤

编译tfcomfile
固化模型
graph.config.pbtxt
编写bazel BUILD脚本
编译对应平台二进制文件 .o .h
编写代码调用AOT模型
编写BUILD

编译对应平台最终产物 .so

环境

Ubuntu18.04
Bazel 0.24
jdk 8
NDK
SDK

文件目录

//tensorflow/compiler/aot/
│  aot_only_var_handle_op.cc
│  benchmark.cc
│  benchmark.h
│  benchmark_main.template
│  benchmark_test.cc
│  BUILD
│  codegen.cc
│  codegen.h
│  codegen_test.cc
│  codegen_test_h.golden
│  codegen_test_o.golden
│  compile.cc
│  compile.h
│  embedded_protocol_buffers.cc
│  embedded_protocol_buffers.h
│  flags.cc
│  flags.h
│  test.cc
│  test_graph_tfadd.config.pbtxt
│  test_graph_tfadd.pbtxt
│  test_graph_tfunknownop.config.pbtxt
│  test_graph_tfunknownop.pbtxt
│  test_graph_tfunknownop2.config.pbtxt
│  test_graph_tfunknownop3.config.pbtxt
│  tfcompile.bzl
│  tfcompile_main.cc
│
├─custom
│  │  BUILD
│  │  com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.cc
│  │  com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.h
│  │  custom_interface.config.pbtxt
│  │  custom_interface_lib.h
│  │  custom_interface_tfcompile_function.o
│  │  custom_interface_tfcompile_metadata.o
│  │  debug.cc
│  │  debug.h
│  │  figure-65.png
│  │  figure-66.jpg
│  │  frozen_custom_010.pb
│  │  input_image.py
│  │  libcustom_interface.a
│  │  libcustom_interface.pic.a
│  │  libcustom_interface.so
│  │  lib_custom_interface.so
│  │  log.h
│  │  log_stream.h
│  │  out.h
│  │  out_helper.o
│  │  out_model.o
│  │  predict_model.py
│  │  Screenshot_67.jpg
│  │  Screenshot_68.png
│  │  tfcompile_h_o.py
│  │  __init__.py
│  │
│  ├─arm64-v8a
│  │      libcustom_interface.so
│  │      lib_custom_interface.so
│  │
│  └─armeabi-v7a
│          libcustom_interface.so
│          lib_custom_interface.so
│
└─tests
        BUILD
        make_test_graphs.py
        test_graph_tfadd.config.pbtxt
        test_graph_tfadd_with_ckpt.config.pbtxt
        test_graph_tfassert_eq.config.pbtxt
        test_graph_tfcond.config.pbtxt
        test_graph_tffunction.config.pbtxt
        test_graph_tfgather.config.pbtxt
        test_graph_tfmatmul.config.pbtxt
        test_graph_tfmatmulandadd.config.pbtxt
        test_graph_tfsplits.config.pbtxt
        test_graph_tftop_k.config.pbtxt
        test_graph_tfvariable.config.pbtxt
        test_graph_tfvariable_sequential_updates.config.pbtxt
        tfcompile_test.cc

Step 0: 编译tfcomfile

首先编译tfcomfile其实就是编译tensorflow源码中的一部分，但这一部分的编译却需要整个工程的依赖

下载源码
1
git clone --recurse-submodules https://github.com/tensorflow/tensorflow
其中–recurse-submodules参数是必须的，用于获取TensorFlow依赖的protobuf库.

配置TensorFlow

1 2	cd ~/tensorflow ./configure

需要注意配置编译项的一些规则

You have bazel 0.25.0 installed.
Please specify the location of python. [Default is C:\ProgramData\Anaconda3\python.exe]: 
Found possible Python library paths:
  C:\ProgramData\Anaconda3\lib\site-packages
Please input the desired Python library path to use.  Default is [C:\ProgramData\Anaconda3\lib\site-packages]
Do you wish to build TensorFlow with XLA JIT support? [y/N]: Y
XLA JIT support will be enabled for TensorFlow.
Do you wish to build TensorFlow with ROCm support? [y/N]: N
No ROCm support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]: N
No CUDA support will be enabled for TensorFlow.
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is
/arch:AVX]: 
Would you like to override eigen strong inline for some C++ compilation to reduce the compilation time? [Y/n]: N
Not overriding eigen strong inline, some compilations could take more than 20 mins.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .ba
zelrc for more details.
        --config=mkl            # Build with MKL support.
        --config=monolithic     # Config for mostly static monolithic build.
        --config=gdr            # Build with GDR support.
        --config=verbs          # Build with libverbs support.
        --config=ngraph         # Build with Intel nGraph support.
        --config=numa           # Build with NUMA support.
        --config=dynamic_kernels        # (Experimental) Build kernels into separate shared objects.
        --config=v2             # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
        --config=noaws          # Disable AWS S3 filesystem support.
        --config=nogcp          # Disable GCP support.
        --config=nohdfs         # Disable HDFS support.
        --config=noignite       # Disable Apache Ignite support.
        --config=nokafka        # Disable Apache Kafka support.
        --config=nonccl         # Disable NVIDIA NCCL support.
Configuration finished

最需要注意的是其中的：Do you wish to build TensorFlow with XLA JIT support? [y/N]: Y!!!这是使用XLA的关键

开始编译tfcompile

tfcompile的Bazel脚本入口在//tensorflow/compiler/aot:tfcompile

执行命令进行编译
1
bazel build //tensorflow/compiler/aot:tfcompile

讲一下这一步骤中遇到的坑

首先编译源码的过程中不同资源是异步编译的经常会出现找不到包的问题，可以多尝试执行上一命令重复进行编译即可

在编译过程中会首先去请求下载各种依赖，这一过程中会出现依赖方已升级版本但本地请求需要验证SHA256（本地SHA256是在代码中写死的）不匹配这一问题，可以根据日志定位依赖下载脚本位置修改对应SHA256值即可

bazel编译过程在下载依赖后是有缓存机制的不必担心下载后的依赖丢失（前提是不执行bazel clean）

在编译tfcompile过程中如果遇到不论在什么网络状态在都无法编译通过的问题尝试bazel clean后重新执行命令或在项目根目录执行bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package进行依赖下载构建依赖包成功后再次执行tfcompile编译命令就会顺畅很多

Step 1: 固化模型

这一步中需要将graph与checkpoints冻结形成.pb格式的固化模型这是之后形成二进制文件的原料

这一步很容易没有什么坑；)

Step 2: graph.config.pbtxt

这一步中主要是为了形成graph的描述，即注明该graph的输入节点的节点名、入参tensor大小，输出节点的节点名等参数

有两种方式可以形成该描述

使用源码中的工具进行自动化形成（需要编写代码）

使用源码中的tf2xla_pb2.py可进行一些操作形成该描述

这里介绍另一种直观且快速的方式

通过可视化工具确定入口出口

NETRON

通过该工具将模型导入后可扩展内部所有节点的描述，即可以得到输入节点与输出节点的描述，进行对应节点描述文件编写

# Each feed is a positional input argument for the generated function.  The order
# of each entry matches the order of each input argument.  Here “x_hold” and “y_hold”
# refer to the names of placeholder nodes defined in the graph.
feed {
  id { node_name: "input" }
  shape {
    dim { size: 1 }
    dim { size: 160 }
    dim { size: 160 }
    dim { size: 3 }
  }
}
# Each fetch is a positional output argument for the generated function.  The order
# of each entry matches the order of each output argument.  Here “x_y_prod”
# refers to the name of a matmul node defined in the graph.
fetch {
  id { node_name: "MobilenetV2/Predictions/Reshape_1" }
}

将该文件保存为graph.config.pbtxt

该步骤完成

WARNING: 需要注意在该描述文件中添加注释时只可使用#作为标示，否则编译不过且定位不到问题位置

Step 3: 编写bazel BUILD脚本

在这一步骤中将进行编写编译脚本很简短的配置但有一些细节需要注意：

建议在//tensorflow/compiler/aot下建立自己的floder将以上生成的产物放入其中，以下操作默认操作在//tensorflow/compiler/aot/custom下进行

在custom目录下创建BUILD脚本文件

load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")
tf_library(
    name = "custom_interface",
    cpp_class = "Classifier",
    graph = "frozen_custom_010.pb",
    config = "graph.config.pbtxt",
)

内部参数解释：

name：即执行编译后将生成产物的名称
cpp_class：即生成C++头文件.h中对该类的命名，可以在该类名前添加作用域such as：foo::bar::Classifier等自定义操作
graph：即之前步骤中生成的冻结图.pb产物
config：即上一步骤中产生的图描述文件

此步骤完成

Step 4: 编译对应平台二进制文件 .o .h

使用命令：bazel build --verbose_failures //tensorflow/compiler/aot/custom:custom_interface

其中custom_interface对应上一步骤中的name

鉴于之前编译过tfcompile，此步骤只是使用了该产物中的部分资源进行编译所以不会有什么坑

这里介绍另一种编译此步骤产物的方式，需要在编译tfcompile步骤后产生的bazel-bin文件中找到tfcompile的run文件

通过命令tfcompile --graph=frozen_custom_010.pb --config=graph.config.pbtxt --cpp_class="Classifier"

这里贴下tfcompile的具体用法其中具有编译对应平台ABI的参数：

tfcompile performs ahead-of-time compilation of a TensorFlow graph,
resulting in an object file compiled for your target architecture, and a
header file that gives access to the functionality in the object file.
A typical invocation looks like this:
   $ tfcompile --graph=mygraph.pb --config=myfile.pbtxt --cpp_class="mynamespace::MyComputation"
usage: ./tfcompile
Flags:
        --graph=""                              string  Input GraphDef file.  If the file ends in '.pbtxt' it is expected to be in the human-readable proto text format, otherwise it is expected to be in the proto binary format.
        --config=""                             string  Input file containing Config proto.  If the file ends in '.pbtxt' it is expected to be in the human-readable proto text format, otherwise it is expected to be in the proto binary format.
        --dump_fetch_nodes=false                bool    If set, only flags related to fetches are processed, and the resulting fetch nodes will be dumped to stdout in a comma-separated list.  Typically used to format arguments for other tools, e.g. freeze_graph.
        --target_triple="x86_64-pc-linux"       string  Target platform, similar to the clang -target flag.  The general format is <arch><sub>-<vendor>-<sys>-<abi>.  http://clang.llvm.org/docs/CrossCompilation.html#target-triple.
        --target_cpu=""                         string  Target cpu, similar to the clang -mcpu flag.  http://clang.llvm.org/docs/CrossCompilation.html#cpu-fpu-abi
        --target_features=""                    string  Target features, e.g. +avx2, +neon, etc.
        --entry_point="entry"                   string  Name of the generated function.  If multiple generated object files will be linked into the same binary, each will need a unique entry point.
        --cpp_class=""                          string  Name of the generated C++ class, wrapping the generated function.  The syntax of this flag is [[<optional_namespace>::],...]<class_name>.  This mirrors the C++ syntax for referring to a class, where multiple namespaces may precede the class name, separated by double-colons.  The class will be generated in the given namespace(s), or if no namespaces are given, within the global namespace.
        --out_function_object="out_model.o"     string  Output object file containing the generated function for the TensorFlow model.
        --out_header="out.h"                    string  Output header file name.
        --out_metadata_object="out_helper.o"    string  Output object file name containing optional metadata for the generated function.
        --out_session_module=""                 string  Output session module proto.
        --gen_name_to_index=false               bool    Generate name-to-index data for Lookup{Arg,Result}Index methods.
        --gen_program_shape=false               bool    Generate program shape data for the ProgramShape method.
        --xla_generate_hlo_graph=""             string  HLO modules matching this regex will be dumped to a .dot file throughout various stages in compilation.
        --xla_hlo_graph_addresses=false         bool    With xla_generate_hlo_graph, show addresses of HLO ops in graph dump.
        --xla_hlo_graph_path=""                 string  With xla_generate_hlo_graph, dump the graphs into this path.
        --xla_hlo_dump_as_graphdef=false        bool    Dump HLO graphs as TensorFlow GraphDefs.
        --xla_hlo_graph_sharding_color=false    bool    Assign colors based on sharding assignments when generating the HLO graphs.
        --xla_hlo_tfgraph_device_scopes=false   bool    When generating TensorFlow HLO graphs, if the HLO instructions are assigned to a specific device, prefix the name scope with "devX" with X being the device ordinal.
        --xla_log_hlo_text=""                   string  HLO modules matching this regex will be dumped to LOG(INFO).
        --xla_generate_hlo_text_to=""           string  Dump all HLO modules as text into the provided directory path.
        --xla_enable_fast_math=true             bool    Enable unsafe fast-math optimizations in the compiler; this may produce faster code at the expense of some accuracy.
        --xla_llvm_enable_alias_scope_metadata=true     bool    In LLVM-based backends, enable the emission of !alias.scope metadata in the generated IR.
        --xla_llvm_enable_noalias_metadata=true bool    In LLVM-based backends, enable the emission of !noalias metadata in the generated IR.
        --xla_llvm_enable_invariant_load_metadata=true  bool    In LLVM-based backends, enable the emission of !invariant.load metadata in the generated IR.
        --xla_llvm_disable_expensive_passes=false       bool    In LLVM-based backends, disable a custom set of expensive optimization passes.
        --xla_backend_optimization_level=3      int32   Numerical optimization level for the XLA compiler backend.
        --xla_disable_hlo_passes=""             string  Comma-separated list of hlo passes to be disabled. These names must exactly match the passes' names; no whitespace around commas.
        --xla_embed_ir_in_executable=false      bool    Embed the compiler IR as a string in the executable.
        --xla_dump_ir_to=""                     string  Dump the compiler IR into this directory as individual files.
        --xla_eliminate_hlo_implicit_broadcast=true     bool    Eliminate implicit broadcasts when lowering user computations to HLO instructions; use explicit broadcast instead.
        --xla_cpu_multi_thread_eigen=true       bool    When generating calls to Eigen in the CPU backend, use multi-threaded Eigen mode.
        --xla_gpu_cuda_data_dir="./cuda_sdk_lib"        string  If non-empty, speficies a local directory containing ptxas and nvvm libdevice files; otherwise we use those from runfile directories.
        --xla_gpu_ftz=false                     bool    If true, flush-to-zero semantics are enabled in the code generated for GPUs.
        --xla_gpu_disable_multi_streaming=false bool    If true, multi-streaming in the GPU backend is disabled.
        --xla_gpu_max_kernel_unroll_factor=4    int32   Specify the maximum kernel unroll factor for the GPU backend.
        --xla_dump_optimized_hlo_proto_to=""    string  Dump Hlo after all hlo passes are executed as proto binary into this directory.
        --xla_dump_unoptimized_hlo_proto_to=""  string  Dump HLO before any hlo passes are executed as proto binary into this directory.
        --xla_dump_per_pass_hlo_proto_to=""     string  Dump HLO after each pass as an HloProto in binary file format into this directory.
        --xla_test_all_output_layouts=false     bool    Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of output layouts. For example, with a 3D shape, all permutations of the set {0, 1, 2} are tried.
        --xla_test_all_input_layouts=false      bool    Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of *input* layouts. For example, for 2 input arguments with 2D shape and 4D shape, the computation will run 2! * 4! times for every possible layouts
        --xla_hlo_profile=false                 bool    Instrument the computation to collect per-HLO cycle counts
        --xla_dump_computations_to=""           string  Dump computations that XLA executes into the provided directory path
        --xla_dump_executions_to=""             string  Dump parameters and results of computations that XLA executes into the provided directory path
        --xla_backend_extra_options=""          string  Extra options to pass to a backend; comma-separated list of 'key=val' strings (=val may be omitted); no whitespace around commas.
        --xla_reduce_precision=""               string  Directions for adding reduce-precision operations. Format is 'LOCATION=E,M:OPS;NAMES' where LOCATION is the class of locations in which to insert the operations (e.g., 'OP_OUTPUTS'), E and M are the exponent and matissa bit counts respectively, and OPS and NAMES are comma-separated (no spaces) lists of the operation types and names to which to attach the reduce-precision operations.  The NAMES string and its preceding ';' may be omitted.  This option may be repeated to define multiple sets of added reduce-precision operations.
        --xla_gpu_use_cudnn_batchnorm=false     bool    Allows the GPU backend to implement batchnorm HLOs using cudnn, rather than expanding them to a soup of HLOs.
        --xla_cpu_use_mkl_dnn=false             bool    Generate calls to MKL-DNN in the CPU backend.

最终生成三个产物:

cat custom_interface_lib.h：

// Generated by tfcompile, the TensorFlow graph compiler.  DO NOT EDIT!
//
// This header was generated via ahead-of-time compilation of a TensorFlow
// graph.  An object file corresponding to this header was also generated.
// This header gives access to the functionality in that object file.
//
// clang-format off
#ifndef TFCOMPILE_GENERATED___xla_tensorflow_compiler_aot_custom__custom_interface_H_  // NOLINT(build/header_guard)
#define TFCOMPILE_GENERATED___xla_tensorflow_compiler_aot_custom__custom_interface_H_  // NOLINT(build/header_guard)
#include "tensorflow/compiler/tf2xla/xla_compiled_cpu_function.h"
#include "tensorflow/core/platform/types.h"
namespace Eigen { struct ThreadPoolDevice; }
namespace xla { class ExecutableRunOptions; }
// (Implementation detail) Entry point to the function in the object file.
extern "C" void __xla_tensorflow_compiler_aot_custom__custom_interface(
    void* result, const ::xla::ExecutableRunOptions* run_options,
    const void** args, void** temps, tensorflow::int64* profile_counters);
// Classifier represents a computation previously specified in a
// TensorFlow graph, now compiled into executable code. This extends the generic
// XlaCompiledCpuFunction class with statically type-safe arg and result
// methods. Usage example:
//
//   Classifier computation;
//   // ...set args using computation.argN methods
//   CHECK(computation.Run());
//   // ...inspect results using computation.resultN methods
//
// The Run method invokes the actual computation, with inputs read from arg
// buffers, and outputs written to result buffers. Each Run call may also use
// a set of temporary buffers for the computation.
//
// By default each instance of this class manages its own arg, result and temp
// buffers. The AllocMode constructor parameter may be used to modify the
// buffer allocation strategy.
//
// Under the default allocation strategy, this class is thread-compatible:
// o Calls to non-const methods require exclusive access to the object.
// o Concurrent calls to const methods are OK, if those calls are made while it
//   is guaranteed that no thread may call a non-const method.
//
// The logical function signature is:
//   (arg0: f32[1,160,160,3]) -> (f32[1,30])
//
// Memory stats:
//   arg bytes total:    307200
//   arg bytes aligned:  307200
//   temp bytes total:   4143008
//   temp bytes aligned: 4143104
class Classifier final : public tensorflow::XlaCompiledCpuFunction {
 public:
  // Number of input arguments for the compiled computation.
  static constexpr size_t kNumArgs = 1;
  // Byte size of each argument buffer. There are kNumArgs entries.
  static const ::tensorflow::int64 ArgSize(::tensorflow::int32 index) {
    return BufferInfos()[ArgIndexToBufferIndex()[index]].size();
  }
  // Returns static data used to create an XlaCompiledCpuFunction.
  static const tensorflow::XlaCompiledCpuFunction::StaticData& StaticData() {
    static XlaCompiledCpuFunction::StaticData* kStaticData = [](){
      XlaCompiledCpuFunction::StaticData* data =
        new XlaCompiledCpuFunction::StaticData;
      set_static_data_raw_function(data, __xla_tensorflow_compiler_aot_custom__custom_interface);
      set_static_data_buffer_infos(data, BufferInfos());
      set_static_data_num_buffers(data, kNumBuffers);
      set_static_data_arg_index_table(data, ArgIndexToBufferIndex());
      set_static_data_num_args(data, kNumArgs);
      set_static_data_result_index(data, kResultIndex);
      set_static_data_arg_names(data, StaticArgNames());
      set_static_data_result_names(data, StaticResultNames());
      set_static_data_program_shape(data, StaticProgramShape());
      set_static_data_hlo_profile_printer_data(
          data, StaticHloProfilePrinterData());
      return data;
    }();
    return *kStaticData;
  }
  Classifier(AllocMode alloc_mode =
            AllocMode::ARGS_VARIABLES_RESULTS_PROFILES_AND_TEMPS)
      : XlaCompiledCpuFunction(StaticData(), alloc_mode) {}
  Classifier(const Classifier&) = delete;
  Classifier& operator=(const Classifier&) = delete;
  // Arg methods for managing input buffers. Buffers are in row-major order.
  // There is a set of methods for each positional argument, with the following
  // general form:
  //
  // void set_argN_data(void* data)
  //   Sets the buffer of type T for positional argument N. May be called in
  //   any AllocMode. Must be called before Run to have an affect. Must be
  //   called in AllocMode::RESULTS_PROFILES_AND_TEMPS_ONLY for each positional
  //   argument, to set the argument buffers.
  //
  // T* argN_data()
  //   Returns the buffer of type T for positional argument N.
  //
  // T& argN(...dim indices...)
  //   Returns a reference to the value of type T for positional argument N,
  //   with dim indices specifying which value. No bounds checking is performed
  //   on dim indices.
  void set_arg0_data(const void* data) {
    set_arg_data(0, data);
  }
  float* arg0_data() {
    return static_cast<float*>(arg_data(0));
  }
  float& arg0(size_t dim0, size_t dim1, size_t dim2, size_t dim3) {
    return (*static_cast<float(*)[1][160][160][3]>(
        arg_data(0)))[dim0][dim1][dim2][dim3];
  }
  const float* arg0_data() const {
    return static_cast<const float*>(arg_data(0));
  }
  const float& arg0(size_t dim0, size_t dim1, size_t dim2, size_t dim3) const {
    return (*static_cast<const float(*)[1][160][160][3]>(
        arg_data(0)))[dim0][dim1][dim2][dim3];
  }
  // Result methods for managing output buffers. Buffers are in row-major order.
  // Must only be called after a successful Run call. There is a set of methods
  // for each positional result, with the following general form:
  //
  // T* resultN_data()
  //   Returns the buffer of type T for positional result N.
  //
  // T& resultN(...dim indices...)
  //   Returns a reference to the value of type T for positional result N,
  //   with dim indices specifying which value. No bounds checking is performed
  //   on dim indices.
  //
  // Unlike the arg methods, there is no set_resultN_data method. The result
  // buffers are managed internally, and may change after each call to Run.
  float* result0_data() {
    return static_cast<float*>(result_data(0));
  }
  float& result0(size_t dim0, size_t dim1) {
    return (*static_cast<float(*)[1][30]>(
        result_data(0)))[dim0][dim1];
  }
  const float* result0_data() const {
    return static_cast<const float*>(result_data(0));
  }
  const float& result0(size_t dim0, size_t dim1) const {
    return (*static_cast<const float(*)[1][30]>(
        result_data(0)))[dim0][dim1];
  }
  // Methods for managing variable buffers. Buffers are in row-major order.
  //
  // For read-write variables we generate the following methods:
  //
  // void set_var_X_data(T* data)
  //   Sets the buffer for variable X.  Must be called before Run if the
  //   allocation mode is RESULTS_PROFILES_AND_TEMPS_ONLY.
  //
  // T* var_X_data()
  //   Returns the buffer of type T for variable X.  If the allocation mode is
  //   RESULTS_PROFILES_AND_TEMPS_ONLY then this buffer is the same as the
  //   buffer passed to set_var_X_data.
  //
  // T& var_X(...dim indices...)
  //   Returns a reference to the value of type T for variable X,
  //   with dim indices specifying which value. No bounds checking is performed
  //   on dim indices.
  //
  // For readonly variables we generate the same set of methods, except that we
  // use `const T` instead of `T`.  We use `const T` to avoid erasing the
  // constness of the buffer passed to `set_var_X_data` but the underlying
  // buffer is not const (and thus the const can be safely const-cast'ed away)
  // unless `set_var_X_data` is called with a pointer to constant storage.
 private:
  // Number of buffers for the compiled computation.
  static constexpr size_t kNumBuffers = 50;
  static const ::xla::cpu_function_runtime::BufferInfo* BufferInfos() {
    static const ::xla::cpu_function_runtime::BufferInfo
      kBufferInfos[kNumBuffers] = {
::xla::cpu_function_runtime::BufferInfo({2293760ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({1228802ULL, 0ULL}),
::xla::cpu_function_runtime::BufferInfo({614400ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({602112ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({301056ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({301056ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({301056ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({301056ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({301056ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({172032ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({98304ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({98304ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({98304ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({98304ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({98304ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({73728ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({55296ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({36864ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({24576ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({24576ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({24576ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({24576ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({24576ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({12288ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({6912ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({6144ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({6144ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({6144ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({6144ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({6144ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({2048ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({481ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({33ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({16ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({19ULL, ~0ULL}),
::xla::cpu_function_runtime::BufferInfo({16571521ULL, ~0ULL})
      };
    return kBufferInfos;
  }
  static const ::tensorflow::int32* ArgIndexToBufferIndex() {
    static constexpr ::tensorflow::int32 kArgIndexToBufferIndex[kNumArgs] = {
1
    };
    return kArgIndexToBufferIndex;
  }
  // The 0-based index of the result tuple in the temporary buffers.
  static constexpr size_t kResultIndex = 38;
  // Array of names of each positional argument, terminated by nullptr.
  static const char** StaticArgNames() {
    return nullptr;
  }
  // Array of names of each positional result, terminated by nullptr.
  static const char** StaticResultNames() {
    return nullptr;
  }
  // Shape of the args and results.
  static const ::xla::ProgramShapeProto* StaticProgramShape() {
    static const ::xla::ProgramShapeProto* kShape = nullptr;
    return kShape;
  }
  // Metadata that can be used to pretty-print profile counters.
  static const ::xla::HloProfilePrinterData* StaticHloProfilePrinterData() {
    static const ::xla::HloProfilePrinterData* kHloProfilePrinterData =
      nullptr;
    return kHloProfilePrinterData;
  }
};
#endif  // TFCOMPILE_GENERATED___xla_tensorflow_compiler_aot_custom__custom_interface_H_
// clang-format on

可以看到其中的具体图已经转换为对应运行时指令。

其中的BufferInfo标记的是为.o文件中的具体运行时二进制字块

Step 5: 编写代码调用AOT模型

这一步中需要注意最终形成的.so在什么平台进行使用，当我们在移动端（Android）使用时，与C++进行通讯需要JNI的支持所以在这一步需要重新configure Tensorflow源码配置SDK、NDK支持，具体SDK\NDK对应target自行选择

下文中的调用C++代码将采用复合JNI规范的代码编写

项目Java书写对应native方法生成对应JNI头文件：

package com.qihoo.cleandroid.sdk.imageclassfier.core.classfier.process;
/**
 * Created by zhanghongxin on 2019/7/22.
 */
public class CustomClassifier {
    private static final String LIBNAME = "custom_interface";
    private CustomClassifier() {}
    /**
     * Load the TensorFlowLite runtime C library.
     */
    static boolean init() {
        try {
            System.loadLibrary(LIBNAME);
            return true;
        } catch (UnsatisfiedLinkError e) {
            System.err.println("custom_interface: failed to load native library: " + e.getMessage());
            return false;
        }
    }
    static {
        init();
    }
    public static native void getPredictResult(float[][][][] input, float[][] output, int inputSize, int outputSize);
}

生成头文件`com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.h`

/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier */
#ifndef _Included_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier
#define _Included_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier
#ifdef __cplusplus
extern "C" {
#endif
/*
 * Class:     com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier
 * Method:    getPredictResult
 * Signature: ([[[[F[[FII)V
 */
JNIEXPORT void JNICALL Java_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier_getPredictResult
  (JNIEnv *, jclass, jobjectArray, jobjectArray, jint, jint);
#ifdef __cplusplus
}
#endif
#endif

编写C++代码实现模型调用`com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.cc`

#define EIGEN_USE_THREADS
#define EIGEN_USE_CUSTOM_THREAD_POOL
#include <iostream>
#include <cstdio>
#include <jni.h>
#include <android/log.h>
#include "custom_interface_lib.h"
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
float* run(float* input, float* output, int input_size, int output_size){
        std::cout << "Load .so SUCCESS" << std::endl;
        Eigen::ThreadPool tp(std::thread::hardware_concurrency());
        Eigen::ThreadPoolDevice device(&tp, tp.NumThreads());
        Classifier classifier;
        classifier.set_thread_pool(&device);
        std::copy(input, input + input_size, classifier.arg0_data());
        auto ok = classifier.Run();
        if (not ok) std::cout << "NOT OK" << std::endl;
//
//        std::cout << "input:";
//        std::cout << input << std::endl;
//
//        std::cout << "input_size:";
//        std::cout << input_size << std::endl;
//
//        std::cout << "classifier.arg0_data():";
//        std::cout << classifier.arg0_data() << std::endl;
//
//        std::cout << "output:";
//        std::cout << output << std::endl;
//
//        std::cout << "output_size:";
//        std::cout << output_size << std::endl;
//
//        std::cout << "result0_data():";
//        std::cout << classifier.result0_data() << std::endl;
//
//        for(int i = 0; i < 30; i++){
//            std::cout << "restul0_";
//            std::cout << i;
//            std::cout << " : ";
//            std::cout << classifier.result0(0,i) << std::endl;
//            __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~OUTPUT== %f~~~~~~~~~~~~~~~\n", classifier.result0(0,i));
//        }
        std::copy(classifier.result0_data(), classifier.result0_data() + output_size, output);
        return output;
    }
#ifndef _Included_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier
#define _Included_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier
#ifdef __cplusplus
extern "C" {
#endif
    JNIEXPORT void JNICALL Java_com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier_getPredictResult
      (JNIEnv *env, jobject obj, jobjectArray inputArray, jobjectArray outputArray, jint inputSize, jint outputSize){
        jboolean isCopy = JNI_FALSE;
          jint rows = env->GetArrayLength(inputArray);
//          __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~inputRows== %d~~~~~~~~~~~~~~~\n", rows);
          jobjectArray tempInputArray = (jobjectArray)env->GetObjectArrayElement(inputArray, 0);
//          __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~input: 1~~~~~~~~~~~~~~\n");
          jobjectArray tTempInputArray = (jobjectArray)env->GetObjectArrayElement(tempInputArray, 0);
//          __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~input: 2~~~~~~~~~~~~~~~\n");
          jobjectArray tTTempInputArray = (jobjectArray)env->GetObjectArrayElement(tTempInputArray, 0);
//          __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~input: 3~~~~~~~~~~~~~~~\n");
          jfloat* input = env->GetFloatArrayElements((jfloatArray)tTTempInputArray, 0);
//          __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~input: %f~~~~~~~~~~~~~~~\n", input);
        jobjectArray tempOutputArray = (jobjectArray)env->GetObjectArrayElement(outputArray,0);
//        __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: 1~~~~~~~~~~~~~~~\n");
        jfloat* output = env->GetFloatArrayElements((jfloatArray)tempOutputArray, 0);
//        __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: %f~~~~~~~~~~~~~~~\n", output);
        jfloat* resultPointer = run(input, output, inputSize, outputSize);
//        jclass floatArrayClz = env->FindClass("[[F");
//        if(floatArrayClz == NULL) return NULL;
//        outputArray = env->NewObjectArray(outputSize, floatArrayClz, NULL );
//        if(outputArray == NULL) return NULL;
        for(int i = 0; i < 1; i++){
            jfloat temp[outputSize];
            jfloatArray floatArray = env->NewFloatArray(outputSize);
//            if(floatArray == NULL) return NULL;
            for(int j=0; j < outputSize; j++){
                temp[j] = *(resultPointer + j);
            }
//            __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: 2~~~~~~~~~~~~~~~\n");
            env->SetFloatArrayRegion(floatArray, 0, outputSize, temp);
//            __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: 3~~~~~~~~~~~~~~~\n");
            env->SetObjectArrayElement(outputArray, i, floatArray);
//            __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: 4~~~~~~~~~~~~~~~\n");
            env->DeleteLocalRef(floatArray);
        }
//        __android_log_print(ANDROID_LOG_INFO, "NATIVE", "~~~~~~~~~output: %f~~~~~~~~~~~~~~~\n", outputArray[0]);
      }
#ifdef __cplusplus
}
#endif
#endif

此步骤结束

Step 6: 编写BUILD

此步骤中所编写的BUILD文件与Step 3中所编写文件为同一文件

cc_library(
    name = "library",
    hdrs = ["custom_interface_lib.h"],
    srcs = ["custom_interface_tfcompile_function.o","custom_interface_tfcompile_metadata.o"],
)
cc_binary(
    name = "libcustom_interface.so",
    srcs = [
    "com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.h",
    "com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.cc",
    ],
    deps = [
    ":library",
    "//tensorflow/compiler/tf2xla:xla_compiled_cpu_function",
    "//tensorflow/core:framework_lite",
    "//tensorflow/compiler/xla:cpu_function_runtime",
    "//tensorflow/compiler/xla/service/cpu:runtime_conv2d",
    "//tensorflow/compiler/xla/service/cpu:runtime_matmul",
    "//third_party/eigen3",
    ],
    linkopts = [
    "-landroid",
    "-shared",
    ],
    linkshared = 1,
    linkstatic = 1,
    copts = ["-fPIC"],
)

其中cc_binary中的name即为生成.so的名称该名称应符合JNI规范

注意生成.so动态链接库时需配置linkshared = 1 linkstatic = 1

Step 7: 编译对应平台最终产物 .so

运行命令：bazel build -c opt //tensorflow/compiler/aot/custom:libcustom_interface.so \ --crosstool_top=//external:android/crosstool \ --host_crosstool_top=@bazel_tools//tools/cpp:toolchain \ --cpu=armeabi-v7a

通过--cpu=xxx来控制编译对应平台ABI的.so

至此编译后所有步骤完成

实验结论

通过XLA加速接入移动端对比（其实并没有可比性因为没有控制变量）

上图中显示了在lite和xla两种方式下使用同一种机型预测174张图片的耗时对比数据

明显lite在这一方面速度非常有优势，但并不代表XLA不起作用，而是恰巧在我们的计算图中使用的卷积网络在这种情况下不适用于XLA进行优化，官方使用JIT的速度对比，其实在JIT的训练时数据和AOT的运行时数据事实上是对应关系恰巧反应了这一关系。

后附几张官方演示图：

Tensroflow XLA AOT 模型优化

2019-07-29

Tensroflow XLA AOT 模型优化

Step -1: 使用XLA将模型编译为AOT（ahead-of-time）代码的步骤

环境

文件目录

Step 0: 编译tfcomfile

Step 1: 固化模型

Step 2: graph.config.pbtxt

Step 3: 编写bazel BUILD脚本

内部参数解释：

Step 4: 编译对应平台二进制文件 .o .h

Step 5: 编写代码调用AOT模型

项目Java书写对应native方法生成对应JNI头文件：

生成头文件`com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.h`

编写C++代码实现模型调用`com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.cc`

Step 6: 编写BUILD

Step 7: 编译对应平台最终产物 .so

实验结论

Reference

Tensroflow XLA AOT 模型优化

Step -1: 使用XLA将模型编译为AOT（ahead-of-time）代码的步骤

环境

文件目录

Step 0: 编译tfcomfile

Step 1: 固化模型

Step 2: graph.config.pbtxt

Step 3: 编写bazel BUILD脚本

内部参数解释：

Step 4: 编译对应平台二进制文件 .o .h

Step 5: 编写代码调用AOT模型

项目Java书写对应native方法生成对应JNI头文件：

生成头文件com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.h

编写C++代码实现模型调用com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.cc

Step 6: 编写BUILD

Step 7: 编译对应平台最终产物 .so

实验结论

Reference

生成头文件`com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.h`

编写C++代码实现模型调用`com_qihoo_cleandroid_sdk_imageclassfier_core_classfier_process_CustomClassifier.cc`